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Abstract 

We consider the channel access problem in a multi-channel opportunistic communication system with 
imperfect channel sensing, where the state of each channel evolves as a non independent and identically 
distributed Markov process. This problem can be cast into a restless multi-armed bandit (RMAB) problem 
that is intractable for its exponential computation complexity. A natural alternative is to consider the easily 
implementable myopic policy that maximizes the immediate reward but ignores the impact of the current 
• strategy on the future reward. In particular, we analyze a family of generic and practically important 

' functions, termed as g-regular functions characterized by three axioms, and establish a set of closed-form 

m : 

ly^ , structural conditions for the optimality of myopic policy. 

in 
o 

^ . Index Terms 

Restless multi-armed bandit (RMAB), myopic policy, opportunistic spectrum access (OSA), Imperfect 
^ i Detection 

I. Introduction 

We consider the restless multi-armed bandit (RMAB) problem in the context of opportunistic multi- 
channel communication system in which a user has access to multiple channels, but is limited to sense 
and transmit only on a subset of them at a time. The fundamental problem is how the user can exploit 
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past observations and the knowledge of the stochastic properties of the channels to maximize its utility 
(e.g., expected throughput) by switching channels opportunistically. 

The RMAB problem, although well defined, is proved to be PSPACH-Hard to solve et al. in HI, and 
very little result is reported on the structure of the optimal policy due to its high complexity. Recently, 
an alternative approach has captured extensive research attention which consists of seeking the myopic 
policy (also termed as greedy policy) which maximizes the expected immediate reward while ignoring the 
impact of the current action on the future. Zhao et al. [21 established the structure of the myopic sensing 
policy, analyzed the performance, and partly obtained the optimality for the case of i.i.d. channels. Ahmad 
and Liu et al. IS derived the optimality of the myopic sensing policy for the positively correlated i.i.d. 
channels when the user is limited to access one channel (i.e., k = \) each time, and further extended 
the optimality to the case of sensing multiple i.i.d. channels {k > 1) [4J. In our previous work ||5] we 
extended i.i.d. channels lO to non i.i.d. ones, and focused on a family of generic and important utility 
functions, termed as regular function, and derived closed-form conditions under which the myopic sensing 
policy is ensured to be optimal. For the imperfect sensing channel model, Liu and Zhao et al. [61 proved 
the optimality of the myopic policy for the case of two channels with a particular utility function and 
conjectured it for arbitrary N. In fT\, we extended the optimality of myopic policy for i.i.d. channels 
from the perfect sensing to the imperfect sensing, and as a consequence, derived closed-form conditions 
to guarantee the optimality of the myopic sensing policy for arbitrary N and for regular function. 

Our study presented in this paper builds upon and extends our earlier work ||5], ||7]. Under the 
assumption of imperfect channel observation, we perform an analytical study on the optimality of the 
myopic policy for the considered RMAB problem. The contribution of this paper, compared with ||5], 
Q, is two-fold: 

• We further generalize the third axiom in |]5] to cover a much larger class of reward functions 
including the logarithmic and exponential functions. The conditions of the optimality are derived in 
the more general with the case in 15] being a special subset. 

• We derive the optimality condition of the myopic policy with imperfect channel observation and 
non i.i.d. channels. The main technical obstacle we overcome is that in the non-perfect sensing case, 
the belief value of a channel depends not only on the evolution itself, but also on the observation 
outcome, which leads to indeterministic transition and nonlinear propagation of the belief vector. 

It is worth noting that despite the vital importance, very few work has been done on the impact of 
imperfect observation on the performance of the myopic policy. To our knowledge, Jll and Q are the 
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only analysis pertinent to our study in this paper. They both focus on i.i.d. channels, while the analysis 
in this paper levitates this assumption by considering the generic heterogeneous case which requires an 
original analysis on the optimality, as detailed later in the paper. Table 1 summarizes the related work 
on the myopic policy and illustrates the work presented in this paper within the context. 

TABLE I 

Summary of related work on myopic policy of RMAB problem 





i.i.d arms 


non i.i.d. arms 


Perfect observation 




m 


Imperfect observation 


(6), m 


this paper 



The rest of the paper is organized as follows: Our model is formulated in Section |II] and then the 
^-regular function is introduced in Section |IlI] Section |IV] studies the optimality of the myopic sensing 
policy. Finally, the paper is concluded by Section |V] 

II. System Model and Problem Formulation 

We consider the multi-channel opportunistic communication system where the user is allowed to sense 
only k {1 < k < N) of the N channels at each slot t. The transmission probabilities of channel i are 
plg,r,s = 0, 1. We assume p\i > pg^, 1 < i < N. We denote the set of channels chosen by the user 
at slot t by A{t) where A{t) C J\f and \A{t)\ = k. We are interested in the imperfect sensing scenario 
where channel sensing is subject to errors, i.e., a good channel may be sensed as bad one and vice versa. 
Let S(t) = [Si{t), • • • , SN{t)] denote the channel state vector where Si{t) € {0, 1} is the state of channel 
i in slot t and let S'(t) = E A{t)} denote the sensing outcome vector where S^{t) = (1) 

means that the channel i is sensed bad (good) in slot t. Using such notation, the performance of channel 
state detection is characterized by two system parameters: the probability of false alarm ei{t) and the 
probability of miss detection Si{t), formally defined as follows: 

e,{t)^Pr{S'M = 0\S,it) = l}, 
6iit)^Pr{Siit) = l\Si{t) = 0}. 

In our analysis, we consider the case where ei{t) and 6i{t) are independent w.r.t. t and i. More specifically, 
we defined e and 6 as the system-wide false alarm rate and miss detection rate. We assume that the user 
only transmits over the channel sensed to be good. 
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We also assume that when the receiver successfully receives a packet from a channel, it sends an 
acknowledgement to the transmitter over the same channel at the end of the slot. The absence of an ACK 
(NACK) signifies that the transmitter does not transmit over this channel or transmitted but the channel is 
busy in this slot. We assume that acknowledgement are received without error since acknowledgements 
are always transmitted over idle channels ||6]. 

Obviously, by sensing only k out of N channels, the user cannot observe the state information of 
the whole system. Hence, the user has to infer the channel states from its past decision and observation 
history so as to make its future decision. To this end, we define the channel state belief vector (hereinafter 
referred to as belief vector for briefness) Q.{t) = {uji{t),i G M}, where < LOi{t) < 1 is the conditional 
probability that channel i is in state good (i.e., Si{t) = 1) at slot t given all past states, actions and 
observations. In order to ensure that the user and its intended receiver tune to the same channel in each 
slot, channel selections should be based on common observations {0 (NACk), 1 (ACK)}*^ rather than the 
detection outcomes at the transmitter. Due to the Markovian nature of the channel model, given the action 
A{t) and the observations {ACKi{t) G {0, 1} : i € the belief vector can be updated recursively 

using Bayes Rule as shown in ([T]). 

pii, ieA{t),ACKi{t) = l 

^i{t + l) = {Ti{ip{uj,{t))), ieAit),ACK,{t) = 0, (1) 
Ti{uji{t)), i^A{t) 

Note that the belief update under ACKi{t) = results from the fact that the receiver cannot distinguish a 
failed transmission (i.e., collides with the primary user with probability 6{l — uji{t))) from no transmission 
(with probability euji{t) + {1 — S){1 — u}i{t))) For convenience, we introduce two operators = 

euj,(t)+l-uj,(t) 

nicoiit)) ^ ojiit) ■ + (1 - • pi^. (2) 

Remark. We would like to emphasize that in contrast to the perfect sensing case where uji{t + 1) is a 
linear function of whether i is sensed or not, in the imperfect sensing case, the mapping from uji{t) 
to uji{t + l) is no longer linear due to the sensing error (cf. the second line of equation ([T|l). In addition, 
Papadimitriou et al ID shows that for N arms, even when the active transition matrix and the passive 
one are deterministic transitions (e.g. either or 1), computing the optimal policy is PSPACE-hard, and 
their proof also shows that deciding the optimal reward is non-zero is also PSPACE-hard, hence ruling 
out any approximation algorithm as well. Unfortunately, the considered problem in this paper just is 
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the case without any approximation algorithm because the belief value update of a channel depends not 
only on the channel evolution itself, but also on the observation outcome, i.e., uji{t + 1) = Ti{uji{t)) 
for i ^ A{t) and uJi{t + 1) = Ti{ip{uJi{t))) for i G A{t),ACKi{t) = 0. Therefore, an original study on 
the optimality of the myopic sensing policy is especially required since these aforementioned differences 
make the analysis for the perfect sensing case no more applicable in the imperfect sensing case. It should 
also be noted that the perfect sensing case can be regarded as a degenerated case with e = 6 = 0. 

A sensing policy vr specifies a sequence of functions ir = [vri , 7r2 , • • • , ttt] where vrt maps the belief 
vector to the action (i.e., the set of channels to sense) A{t) in each slot t: ttj : Q(t) — > A{t), \A{t)\ = 
k. 

Given the imperfect sensing context, we are interested in the user's optimization problem to find the 
optimal sensing policy vr* that maximizes the expected total discounted reward over a finite horizon: 



TT* = argmaxE 



^^(1) 



(3) 



it=i 

where R{7rtifl{t))) is the reward collected in slot t under the sensing policy ttj with the initial belief 
vector < /3 < 1 is the discounted factor characterizing the feature that the future rewards are 

less valuable than the immediate reward. By treating the belief value of each channel as the state of each 
arm of a bandit, the user's optimization problem can be cast into a restless multi-armed bandit problem. 

In this paper, we focus on the myopic sensing policy which is easy to compute and implement that 
maximizes the immediate reward, formally defined as follows: 

Definition 1 (Myopic Sensing Policy). Let F{QA{t)) = E[i?(7rt(rj(t)))] denote the expected immediate 
reward obtained in slot t under the sensing policy Tit, the myopic sensing policy A{t), consists of sensing 
the k channels that maximizes F{QA{t)), i-^-, -^(0 — ^^S''-^^^A{t)CAf ^{^Ait)). 

In the sequel analysis, we establish closed-form conditions under which the myopic sensing policy is 
guaranteed to be optimal. Before ending this section, we state some structural properties of Ti{iOi{t)) and 
ip{LOi{t)) that are useful in the subsequent proofs. 

Lemma 1. For any positively correlated channel i (i.e., Pqi < Pii), the following structural properties 
of Ti{ijJi{t)) hold: 

'if no information on tlie initial system state is available, each entry of 0(1) can be set to the stationaiy distribution u]}, = 

— ^2J_ 1< i < iV. 
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• Ti{u}i{t)) is monotonically increasing in uji{t); 

. Phi < Ti{uji{t)) < p\i. yo<oji{t)< 1. 

Proof: Noticing that Ti{uji{t)) can be written as Ti{oJi{t)) = {p\i —p}ji)uii{t) +^01' Lemma [T] holds 
straightforwardly. ■ 

Lemma 2. ip{oji{t)) monotonically increases with uji{t) when < e < 1. 

Proof: Noticing that (p{uJi) = eu; ft)+i-a; (t) ' Lemma |2] follows straightforwardly. ■ 

in. Axioms 

This section defines three axioms characterizing a family of generic and practically important functions 
referred to as g-regular functions, which serve as a basis for the further analysis on the structure and the 
optimality of the myopic sensing policy. Without ambiguity, we drop the time index of u:i{t), and abuse 
uJi{t) and UJi alternatively. 

Axiom 1 (Symmetry Q). A function /{^a) '■ [0, 1]^ — ?• M is symmetrical if for any two distinct channels 
i and j, it holds that 

Axiom 2 (Monotonicity HI). A function /{^a) ■ [0, l]'^ — M is monotonically increasing if it is 
monotonically increasing in each variable uoi, i.e., 

Lo[ > UJi =^ /(wi, ■ ■ ■ ,uji, - ■ ■ ,ujk) > f{uji, ■■■ ,u)i,--- ,u}k), Vi < k. 

The above axioms are the intuitive with Axiom [T] stating that once the sensing set A is given, the 
sensing order will not change the final reward under a symmetrical function /. The following axiom, 
however, significantly extends the axiom of decomposability in ||5l so as to cover a much larger range 
of utility functions. 

Axiom 3 ((^-Decomposability). A function /{^.a) ■ [0, 1]''" — > M is decomposable if there exists a 
continuous and increasing function g : [0, 1] [0, oo) and a constant c such that for any i < k it 
holds that 

f{uji, ■ ■ ■ ,uji-i,uji,u}i+i, ■■■ ,ujk) = c- g{uJi)f{uji, ■ ■ ■ ,uji^i, l,UJi+i, ■ ■ ■ ,ujk) 

+ c- (1 - g{uji))f{uji,--- ,Wi_i,0,a;i+i, • • • ,ujk). 
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Axiom [3] on the ^r-decomposability states that f{^A) can always be decomposed into two temis by 
introducing the function g and replacing Ui by and 1, respectively. It is insightful to note Axiom 
of g(-decomposability significantly extends Axiom of decomposability in fSj by covering a much larger 
range of utility functions which cannot be covered by latter, particularly the logarithmic function (e.g., 
fi^A) = Yli=i loga(l + ^i) (« > 1)' where c = j^J-^, g{LOi) = log2(l + coi) ) and the power function 
(e.g., f{^A) = X]i=i a > 0, where c = 1, g{u}i) = ujf) that are widely used in engineering problems. 
By setting g{iOi) = uji and c = 1, Axiom |3] degenerates to the Axiom of decomposability in IS]. 

In the following, we use the above axioms to characterize a family of generic functions, referred to as 
g-regular functions, defined as follows. 

Definition 2 ((/-Regular Function). A function is called g-regular if it satisfies all the three axioms. 

If the expected reward function F is (/-regular, the myopic sensing policy, defined in Definition 1, 
consists of sensing the k channels with the largest belief values. In case of tie, we can sort the channels 
in tie in the descending order of t<Ji(i + 1) calculated in ([T|l. The argument is that larger u}i{t + 1) leads to 
larger expected payoff in next slot t + 1. If the tie persists, then the channels are sorted by their indexes. 

IV. Analysis on Optimality of Myopic Sensing Policy under Imperfect Sensing 

In this section, we establish the closed-form conditions under which the myopic sensing policy achieves 
the system optimum under imperfect sensing. To this end, we set up by defining a pseudo value function 
and studying its structural properties which are then used to establish the main result on the optimality. 

A. Pseudo Value Function 

Armed with the three axioms, this section first defines the pseudo value function in the imperfect 
sensing case and then derives several fundamental properties of it, which are crucial in the study on the 
optimality of the myopic sensing policy. We start by giving the formal definition of the pseudo value 
function in the recursive form. 

Definition 3 (Pseudo Value Function). The pseudo value function, denoted as (I < t < T, 
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t + l<r<T)is recursively defined as follows: 
WrmT)) = F{nj{T))- 

Wr{n{r)) = F{n^{r))+l3Z^^j Pr{A{r),£)Wr+i{ne{r + I)); 

^ (4) 

WtiQAit)) = F{nA{t)) + P Yl ^K^(i), £)Wt+i{n£{t + 1)), 

£CA{t) 

' V ' 

mA{t)) 

where ^£{t + 1) and ^^{r + 1) are generated by {^l{t),A{t),£) and {Q{r),A{r),£), respectively, 

according to ©, and Pr{M,£) = JJ(1 - e)uji{t) J| [1 - (1 - e)wj(i)]. 

ie£ jeM\£ 

The pseudo value function gives the expected discounted accumulated reward of the following sensing 
policy: in slot t sense the channels in A{t) and then sense the channels in A{r) {t + 1 < r < T) (i.e., 
adopt the myopic policy from slot t + 1 to T). If A{t) = A{t), then the above sensing policy is the 
myopic sensing policy with Wt{^A{t)) being the total reward from slot t to T. 

Lemma 3. If the expected reward function F{nA) is g-regular, the correspondent pseudo value function 
Wt{'^A{t)) is symmetrical about Ui, ujj where i,j A or i,j ^ A for all t = 1,2, ■■ ■ ,T. 

Proof: The lemma can be easily shown by backward induction noticing that F{Qa) is symmetrical 
about oji,ujj, and (ui, ■ ■ ■ , Ui, ■ ■ ■ ,ujj, - ■ ■ , ujn) and {ui, ■ ■ ■ ,ojj, - ■ ■ , Ui, ■ ■ ■ , ujn) generate the same 
belief vector 0.{t + 1) no matter whether i,j £ A or i,j ^ A, combined with the fact that the myopic 
policy is adopted from slot t + 1 to T by (IDl, we conclude Wt+i{0.£{t + l)) is symmetrical about uii,ujj. 
Thus the lemma holds. ■ 

B. Myopic Sensing Policy: Condition of Optimality 

In this subsection, we study the optimality of the myopic sensing policy. For the convenience of 
discussion, we firstly state some notation before presenting the analysis. 
. = max{plJ, = max{p^i}; 

• 9min= mm \,9max= <^ — \, 

• Let ijj-i = {ojj : j e A, j i} denote the believe vector except cjj, and 

Amax = ^ ^Q^N-i {^(I'^-i) - F{0,UJ_i)}, 
Amin = ^^YlN-i {^(^''^-») ~ F{0,UJ-i)}. 
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We start by showing the following important lemma (Lemma lU and then establish the sufficient 
condition under which the optimality of the myopic sensing policy is ensured. In Lemma IH we consider 
D.I = [wi, • • • ,uji, - ■ ■ ,ujn] and U'l = [uji, - ■ ■ t^n] which differ only in one element > uji. 

Let A' and A denote the largest k elements in f]J and Vli, respectivelj^ Lemma |4] gives the upper and 
lower bounds of Wt{^A') — Wt{^A)- 

Lemma 4. If the expected reward function F{VLa) is g-regular, Ml G M, uji < and 1 < t < T, we 
have 

1) ifleA' and I € A, then 

T-t 

1=0 
T-t 

2) ifli A! and I i A, then < Wti^A') - Wti^A) < c ■ {lo'i - ui)g'„,,,A^a. Yl ^'i^r^T' 

i=l 
T-t 

3) ifle A' and I i A, then < WtiSiA') - WtiSlA) < c ■ {lo'i - ui)g'„,,.,A^a. Yl ^'i^r^T- 

Proof: The proof is given in the Appendix |A] ■ 

Remark. It can be noted that the case I ^ A' and I € Ais impossible to exist according to the definition 
of the myopic sensing policy. 

In the following lemma, we consider Wt{0,Ai) and M^t(f^yl„) where Ai and Am differ in one element 
(I G Ai and m € Am and uji > ujm)- Lemma|5]establishes the sufficient condition under which Wt{^A,) > 
Wt{O.A^) when F is g-regular. 

Lemma 5. If F{^a) is g-regular and > ^ /3*(5;;*'^^)\ then Wt{^A,) > Wti^Aj holds 

dmaxA-max .^-^ 

forl<t<T. 

Proof: Let Q' denote the set of channel belief values with uj'^ = ujm and oo'- = LOi for Mi ^ I, apply 
Lemma m we have 

Wti^A,) - Wti^Aj = [Wt{VLA,) - Wt{^')] - [Wti^Aj - Wt{VL')] 

T-t 

>C ■ (uJi - U:m)gmin^min - C ■ (ui - Um)g'max^max Y ^"^^T""^' 

i=l 

^The tie, if exists, is resolved in the way as stated in remark after Definition 3 
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>C • {UJI - UJm)g'max^max 



q' . A 



9v 



T-1 

mm \ - ntf <:max\t 



i=l 



> 



if the conditions in the lemma hold. ■ 
The following theorem studies the optimality of the myopic sensing policy under imperfect sensing. 
The proof is similar to that of Theorem 1 in ||5] and is thus omitted here. 

Theorem 1. The myopic sensing policy is optimal if the following two conditions hold: (1) the expected 



slot reward function F is g-regular; (2) —j 



T-1 



i=l 



dmax^rnax 

Theorem [T] generalizes the results with perfect sensing (Theorem 1 in our previous work in two 
aspects. First, with the more generic axiom on the decomposabihty of the expected slot reward function, 
the result can now cover a much larger class of reward functions including the logarithmic and power 
functions which are widely encountered in practical scenarios. Secondly, Theorem [1] also generahzes the 
optimality of myopic sensing policy to cover the imperfect sensing case. 

The following theorem further establishes the optimality conditions in asymptotic case T — )• cxo. The 
proof follows straightforwardly from Theorem [Tlby noticing that X]£i a^* = — x) for any x € (0, 1). 



Theorem 2. In the infinite horizon case T ^ oo, the myopic sensing policy is optimal if the following con- 
ditions hold: (1) the expected slot reward function F is g-regular; (2) (3 < 



dmin^rnin 



id'min^rnin + dmax^rnax)^^ 



C. Discussion 



We consider the channel access problem where a user is limited to sense k of N i.i.d. channels and gets 
one unit of reward if the sensed channel is in the good state, i.e., the utility function can be formulated as 
F{Q.a) = X]jeA[(^ ~ To that end, we apply Theorem 1 of Q and have A^m = ^max = 1 — e. We 

can then verify that when e < it holds that < ■ ^ r > 1. Therefore, 

when the condition 1 and 2 of Theorem 1 in Q hold, the myopic sensing policy is always optimal 
for any /3, which significantly extends the results obtained in ||6]. Regarding the similar scenario with 
non i.i.d. channels, we have c = 1, (7(0;) = uj and A^m = ^max = 1 — e, and furthermore know 
that the myopic policy is optimal for any /3 and e if S™'^^ < 0.5 according to Theorem |2] Compared 
to the optimal conditions f7] for i.i.d. channels, although all focusing on the optimality of the myopic 



policy, the closed-form conditions of optimahty derived in this paper are much stricter with respect to the 

transmission probabilities (6'^^^ < 0.5 in our paper) but much looser in false alarm rate (e < 

in Q). The stricter constraint on the transmission probabilities is due to the proposed method itself which 
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sacrifices part of tiie optimality to cover tiie case of non i.i.d. ciiannels, wiiile tlie looser constraint on 
the sensing error comes from the fact that all the channels are only discriminated as sensed channels or 
non-sensed channels at each slot under which the sensing error can be absorbed without any constraint. 

V. Conclusion 

We have investigated the optimality of the myopic policy in the RMAB problem with imperfect sensing, 
and developed three axioms characterizing a family of generic and practically important functions which 
we refer to as (7-regular functions. By performing a mathematical analysis based on the developed axioms, 
we have characterized the closed-form conditions under which the optimality of the myopic policy is 
guaranteed. As future work, a natural direction we are pursuing is to investigate the RMAB problem with 
multiple players with potentially conflicts among them and to study the structure and the optimality of 
the myopic policy in that context. 

Appendix A 
Proof of Lemma H] 

We prove the lemma by backward induction. 

For slot T, noticing that Wri^A) = F{^a) and that g'^,,^ < '-^j^ < g'^a. for any p™" < c^' < 
^ < p™"^', we have 

1) For I ^ A'.l e A, it holds that 

c • {u}[ - uji)g'^i^Amin < Wt{^Ia') - Wt{^a) < c ■ [g{uj'i) - g{uJi)]Amax < c ■ {uj'i - UJl)g'^^^Ajnax; 

2) For / ^ A', it holds that / ^ A, WrinA') - WriflA) = 0; 

3) For I £ A', I ^ A, it exists at least one channel m such that w^' > cOm > ^i- It then holds that 

0<c-{u'i- uJi)g'^i^Arain < Wt{^A') - Wri^A) < c ■ [g{u;'i) - g{u;m)]Araax 

< C • [g{uj'i) - g{ui)]Ajnax < C ■ {uj'i - OJl)g'^,,^Amax] 

Therefore, Lemma 5] holds for slot T. 

Assume that Lemma |4] holds for T, • • • , t + 1. We now prove the lemma for slot t. 

We first prove the first case: / G A' and / e ^. By rewriting T{^A{i)) in © and developing 
ijji{t + 1) in 0(t + 1) , we have: 

mA') = (1 - e)u:[{t)m\,) + (1 - (1 - 6)^Ki))r(n^^"'^) (5) 



March 4, 2013 



DRAFT 



12 



r(O^) = (1 - e)uiit)T{n\) + (1 - (1 - 6)u;Ki))r(l^?"'^) (6) 

where, Q\, and Q'^^'^^ denote Qa' with a;[(t) = 1 and (^(a;[) , respectively, while Q\ and il^^'^'^ denote 
^Ia with t<J;(t) = 1 and ip{LOi), respectively. 
Noticing = Q\, we have 

r{QA') - mA) =(1 - e){uiit) - u;imm\') - mt'^)] 

Considering the whole realization of the belief vector, we further have 

mA'{t))-mA{t))= Yl U(^-^)^^it) n 

£CA{t)\{l} ies jeA{t)\£\{l} 

{(1 - e)iiv'iit) - u;imWt+i{^i=iit + 1)) - Wt+i{ni^^(^^,)it + 1))] 

+ (1 - (1 - e)ui{t))[Wt+i{ni=^^^>j{t + 1)) - m+i(f^/=^(a.,)(* + 1))]} (7) 

where, ^i=a{t + 1) (a € {l,ip{uj'i),ip{uji)}) denotes the belief vector at slot t + I under Q{t) with 
uji{t + l) =Ti{a). 

Next, we derive the bound of Wt+i{^i=i(t + 1)) — Wt+i{^i=^(^i^'^)(t + 1)) through three case^: 
• Case 1: if / € A'{t + 1) and I € A{t + 1), according to the induction hypothesis, we have 

< c • - Ti{ip{u;l)))g'^i^Armn <T^t+i (^^/=i + 1)) - m+i(J^i=^K)(^ + 1)) 

T-t-1 



i=0 

Case 2: if / ^ A'{t + 1) and / ^ A{t + 1), according to the induction hypothesis, we have 



1=1 

• Case 3: if / G A'{t + 1) and / ^ + 1), according to the induction hypothesis, we have 

T~t-l 
1=0 

Combining the three cases, we obtain 



< Wt+i{ni=i{t + 1)) - + 1)) 

^It can be noted that the case I ^ A'{t + 1) and I G A{t + 1) is impossible. 
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T~t~l 

<c-{:p\,-n{^{u:[)))C^^a. /3^(^r'r 

T-t-1 

(Pn-Poi)5™a.A™,. ^'i^P^'- (8) 

i=0 

According to Lemma [T] and |2l we have Ti{ip{uj'i)) > Ti{ip{uoi)) when oo'i > uji. Thus we have the bounds 
of VFf+i(r2;=(p(^j)(t + 1)) — VFi+i(0;=<^(^j)(t + 1)) by the similar induction as follows: 

< Wt+i{^i^^^^,){t + 1)) - W^tH.i(17;=^(,,)(t + 1)) 

T-t-1 

< c ■ (tMuD) - TMui)))g'^,^Am.a. Yl f^'i^r^y 

1=0 

= ■ [1 - (1 - e)4][l - (1 - e)ui] ^^'^1 " ^'0l)5;^axA,„ax g p'iS^^n"- (9) 
Combining dSjl and ^ and recalling — < S^°-^ , we have 

T-t-l 
i=0 

Since ViVtA-it)) - r(J^A(0) > and 
we have 

c • {ul - uji)g'^,^Amin < Wt{nA'{t)) - WtinAit)) 

= F{nA' (t)) - F{nA{t)) + m^A' (t)) - mAim 

T-t-1 
i=0 

T-t 
j=0 

We thus complete the proof of the first part (/ € A' and / e ^) of Lemma [3] 
Secondly, we prove the second case / ^ A' and I ^ A. To this end, we have: 

T{SlA{t))= Y - e)u^^it) n [l-(l-eHW]W^m(^^z(i + l)) 

mA'{t))= Y U(^-e)u^iit) n [i-(i-eH(t)]m+i(f^Ki+i)) 

where + 1) and + 1) are the belief vector for slot t + I generated by 0,A{t) and ^A'{t) based 
on the belief update equation ([T|l. 



euj. 



1 -(^ -f) 



111, 
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We distinguish the following four cases: 

• If channel I is never chosen for Q,i{t + 1) and ^l'i{t + 1) from the slot t + 1 to the end of time 
horizon of interest T, that is to say, / ^ A'{r) and / ^ A{r) for t + 1 < r < T, it is easy to know 
r(OA'(t)) - T{nA{t)) = 0, furthermore Wt{nA'{t)) - Wt{nA{t)) = 0; 

. There exists t° (t + 1 < t° < T) such that I ^ A'{r) and I ^ A{r) for t + 1 < r < t° - 1 while 
I ^ A'{f) and / G A{t°). For this case, it holds A'{r) = A{r) for t + 1 < r < t° - 1 while A'{r) 
and A{r) differ in one element, assume that m G A'{t'^) and m ^ -4(r). According to the definition 
of the myopic policy, it follows coi{t^) > O0m{t^) and uj'i{t^) < ujra{t^), which leads to contradiction 
since oj[{t + 1) = p\i > 0Ji{t + 1) = p^^ leads to uj[{t^) > uji{t^) following Lemma HI This case is 
thus impossible to happen; 

. There exists t° (t + 1 < t° < T) such that / ^ A'{r) and I ^ A{r) for t + 1 < r < t° - 1 while 
I € A'{t^) and / G A{t^). For this case, according to the hypothesis (/ G A' and I G ^1), we have 

T-f 

< wAn'lin) - wtoiniin) < c ■ {unn - ui{n)g'^,,A^a. Yl f^'(^r^y 

T-r 

Noticing t*^ > t + 1, we have 

T-t-l 

0<m+i(J]Ki + l))-W^t+i(^^/(i + l)) <c-(p'ii-p[,i)(a;Ki)-^Ki))5LxA„.a. J] /3^(<5™-r. 
Furthermore, 

< Wt{nA'{t)) - Wt{nA{t)) = mnA'it)) - mAit))) 

T-t-l 

<^-c-{p[,-pMit)-^imLa.^ma. Yl /^^(^r'r 



i=0 

T-t-l 

max\ 
P 

i=0 

T-t 

i 



i=l 

There exists (t + 1 < f < T) such that I ^ A'{r) and I ^ A{r) for t + 1 < r < t° - 1 while 
I G A'{t^) and / ^ A{t^). For this case, by the induction hypothesis (l G A' and I ^ A), we have 

T-f 
i=0 
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1=0 

Noticing that t + 1 < t°, we have 

T-t-l 
i=0 

Therefore, we have 

o<Wt{nA'{t))-Wt{nA{t)) 

<P-C-{u:[{t)-Ui{t)){p\l-P0l)Ca.^^a. ^ /^'(^r'r 

j=0 

T-t-1 

T-t 

= c{u:[{t) - u:i{t))c^,,a. Y f^'i^r^y- 

i=l 

Combining the above results, we complete the proof of the second part {I ^ A' and Z ^ ^) of Lemma [3] 
Last, we prove the third case I € A'{t) and / ^ A{t). In this case, there must exist a channel m 
such that u'l > ojm > and oj[ G A! and ujm G A. We then have 

Wt{^A'{t))-Wt{^A{t)) 
=Wtiu)i, ■■■ ,Uj'i,--- ,UJn) - Wt{uJi, ■■■ ,UJi= Um, ■ ■ ■ ,UJn) 

+ Wt{uji, ■■■ ,ui = uj„^, ■ ■ ■ ,ujn) - Wt{uji, ■■■ ,uji,--- ,ujn) (10) 

According to the induction hypothesis (/ G A' and / € A), the first term of the right hand of ([TOl i can 
be bounded as follows: 

< Wt{uji, ■■■ ,uj'i,--- ,ujn) - Wt{uji, ■■■ ,uji = ujm, ■ ■ ■ , Wat) 

T-t 

< C ■ - LO.^{t))g'^a^Ama. Y /?'(^r')^ dD 

i=0 

Meanwhile, the second term of the right hand of ([TO] i is bounded by induction hypothesis (/ ^ A' and 
I ^ A) as: 

< Wt{uJi, ■■■ ,UJl= UJm, ■ ■ ■ ,UJn) - Wt{uJi, ■■■ ,UJl,--- ,ujn) 
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T-t 

Therefore, we have, combining ([TOl i. ([TTI i and (fT2] l. 

1=0 

which completes the proof of the third part (l ^ A' and Z ^ ^) of Lemma [3] Lemma |3] is thus proven. 
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